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ABSTRACT 

A method is presented for performing joint analyses of cosmological datasets, in which the 
weight assigned to each dataset is determined directly by it own statistical properties. The 
weights are considered in a Bayesian context as a set of hyperparameters, which are then 
marginalised over in order to recover the posterior distribution as a function only of the cos- 
mological parameters of interest. In the case of a Gaussian likelihood function, this marginal- 
isation may be performed analytically. Calculation of the Bayesian evidence for the data, with 
and without the introduction of hyperparameters, enables a direct determination of whether 
the data warrant the introduction of weights into the analysis; this generalises the standard 
likelihood ratio approach to model comparison. The method is illustrated by application to 
the classic toy problem of fitting a straight line to a set of data. A cosmological illustration 
of the technique is also presented, in which the latest measurements of the cosmic microwave 
background power spectrum are used to infer constraints on cosmological parameters. 

Key words: cosmic microwave background - methods: data analysis - methods: statistical. 



1 INTRODUCTION 



It is now common practice in cosmology to estimate the values of 
cosmological parameters by a joint analysis of a number of differ- 
ent datasets. The standard technique for performing such an analy- 
sis is to assume that the datasets are statistically independent and so 
take the joint likelihood function for the parameters to be given sim- 
ply by the product of the individual likelihood functions for each 
separate dataset. The joint likelihood function can then be used in 
the standard way to determine the optimal values of the parameters 
and their associated errors. 

As discussed by Lahav et al. (2000; hereinafter Paper I), how- 
ever, there exists some freedom in the relative 'weight' that may be 
given to each dataset in the analysis (see also Godwin & Lynden- 
Bell 1987; Press 1996). The assignment of weights often occurs 
when two or more of the datasets are inconsistent, and is usually 
made in a somewhat ad-hoc manner. Typically some datasets are 
excluded from the analysis, and hence given a weight of zero, while 
the remainder are analysed jointly with equal weights. Despite its 
widespread use, this procedure has many unsatisfactory features, 
not least of which is the subjectivity associated with the choice of 
which datasets to include, and which to discard. As advocated in 
Paper I, a more objective procedure for assigning weights to the 
datasets is provided by the introduction of hyperparameters. This 
device allows the statistical properties of the data themselves to de- 
termine the weight given to each dataset in the analysis. 

In Paper I, a method was presented for introducing hyperpa- 
rameters into the analysis of datasets for which each likelihood 



function had a particular simple form. This technique was then 
applied to the estimation of the Hubble parameter h from several 
sets of observations of the power spectrum of cosmic microwave 
background (CMB) anisotropies. It was shown that apparently dis- 
crepant datasets could be analysed jointly to provide a consistent 
estimate of h, together with an associated error. This approach has 
also recently been applied to the joint analysis of the baryon mass 
fraction in clusters and cepheid-calibrated distances (Erdogdu, Et- 
tori & Lahav 2002). 

In this paper, we extend the work of Paper I to accommodate 
more general situations. In section ^, we review the standard ap- 
proach to Bayesian parameter estimation and discuss the role of 
the Bayesian evidence in model selection. In section^ we present 
a general account of hyperparameters and their use in joint esti- 
mation of parameters. We also discuss how to use the Bayesian 
evidence to decide whether or not the data support the inclusion 
of hyperparameters in the first instance. In section ^, we consider 
the use of hyperparameters in the weighting of datasets and, in sec- 
tion^, we discuss the common special case in which the likelihood 
function for each dataset is Gaussian. In section ^, we illustrate 
the useful properties of the hyperparameters approach by applying 
the technique to the toy problem of fitting a straight-line to a num- 
ber of datasets. A cosmological application of the hyperparameters 
approach is discussed in section Q, in which we perform a joint 
estimate of the physical baryon density and the scalar spec- 
tral index n using the most recent sets of observations of the CMB 
power spectrum. Finally, our conclusions are presented in section^. 
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2 B AYES' THEOREM AND EVIDENCE 

Suppose the totality of our data is represented by the data vector 
D and we are interested in estimating the values of the parameters 
6 in some underlying model of the data. The standard approach to 
this problem is to use Bayes' theorem 

Pr(.D|0)Pr(0) 



Pr(0\D) 



Pr(£>) 



(1) 



which gives the posterior distribution Pr(0|D) in terms of the like- 
lihood Pr(D\0), the prior Pr(6) and the evidence Pr(D) (which is 
also often called the marginalised likelihood). 

2.1 Parameter estimation 

For the purpose of estimating parameters, one usually ignores the 
normalisation factor ~Pi(D) in Bayes' theorem, since it does not 
depend on the parameters 6. Thus, one instead works with the 'un- 
normalised posterior' 



Pr(0|.D) = Pr(.D|0)Pr(0) 



(2) 



where we have written Pr to denote the fact that the 'probability 
distribution' on the left-hand side is not normalised to unit volume. 
In fact, it is also common to omit normalising factors, that do not 
depend on the parameters 6, from the likelihood and the prior. As 
we shall see below, however, if one wishes to calculate the Bayesian 
evidence for a particular model, the likelihood and the prior must be 
properly normalised such that J Pr(6) d6 = 1 and [Pr(D\6)dD = 
1. We will therefore assume here that the necessary normalising 
factors have been retained. 

Strictly speaking, the entire (unnormalised) posterior is the 
Bayesian inference of the parameters values. Unfortunately, if the 
dimension M of the parameter space is large, it is often numeri- 
cally unfeasible to calculate Pr(0|_D) on some M-dimensional hy- 
percube. Thus, particularly in large problems, it is usual to present 
one's results in terms of the 'best' estimates 6, which maximise 
the (unnormalised) posterior, together with some associated errors. 
These errors are usually quoted in terms of the estimated covariance 
matrix 

C=[-VVhx¥i(e\D)\ g=6 ]- 1 . (3) 

or as confidence limits on each parameter a,- (i = 1,...,M), ob- 
tained from the one-dimensional marginalised (unnormalised) pos- 
terior distributions 



Pr(ai|X>) = J Pr{6\D)d6, 



(4) 



where d6 = da\ ■ ■ ■ d&i ■ ■ ■ daM denotes that the integration is per- 
formed over all other parameters aj (j ^ i). 

The estimates 6 are most often obtained by an iterative nu- 
merical minimisation algorithm. Indeed, standard numerical algo- 
rithms are generally able to locate a local (and sometimes global) 
maximum of this function even in a space of large dimensional- 
ity. Similarly, the covariance matrix of the errors can be found 
straightforwardly by first numerically evaluating the Hessian ma- 
trix VVPr(0| D) at the peak 6 = 6, and then calculating (minus) its 



2.2 Bayesian evidence and model selection 

The standard technique outlined above produces inferences of the 
parameter values for a given model of the data, but it does not pro- 
vide a mechanism for deciding which one of a set of alternative 



models is most suitable for describing the data. This problem may 
be addressed, however, using the Bayesian evidence Px(D). Very 
readable introductions to this topic are given by Bishop (1995) and 
Sivia(1996). 

Although the evidence term is usually ignored in the process 
of parameter estimation, it is central to selecting between different 
models for the data. For illustration, let us suppose we have two 
alternative models (or hypotheses) for the data D; these hypothe- 
ses are traditionally denoted by Hq and H i . Let us assume further 
that the model Hq is characterised by the parameter set 6, whereas 
H\ is described by the set of parameters 0. For the model Hq, the 
probability density for an observed data vector D is given by 



Pr(£)|// )= Pi(D\6)Pr(6)d6, 



(5) 



where, on the left-hand side, we have made explicit the condition- 
ing on Hq. Similarly, for the model Hi , 



Pr(D\H\) = J Pr(D\4>)Pr(4>)d4>. 



(6) 



In either case, we see that the evidence is given by the average of 
the likelihood function with respect to the prior. Thus, a model will 
have a larger evidence if more of its allowed parameter space is 
likely, given the data. Conversely, a model will have a small ev- 
idence if there exist large areas of the allowed parameter space 
with low values of the likelihood, even if the likelihood function 
is strongly peaked and the corresponding model predictions agree 
closely with the data. Hence the value of the evidence naturally 
incorporates the spirit of Ockham's razor: a simpler theory, hav- 
ing a more compact parameter space, will generally have a larger 
evidence than a more complicated theory, unless the latter is signif- 
icantly better at explaining the data. 

The question of which is the models Hq and H\ is prefered is 
thus answered simply by comparing the relative values of the evi- 
dences Pi(D\Hq) and Pr(D\H\). The hypothesis having the larger 
evidence is the one that should be accepted. Although not widely 
used in cosmology, the idea of model selection using evidence ra- 
tios has been considered previously in this context by laffe (1996) 
and, more recently, by John & Narlikar (2001). 

2.3 The Gaussian approximation to the posterior 

Unfortunately, the evaluation of an evidence integral, such as <Jsj> , 
is a challenging numerical task. From fl2J), we see that 



?t{D\Hq) = J Pi(6\D)d6, 



(7) 



and so the evidence may only be evaluated directly if one can cal- 
culate Pr(6\D) over some hypercube in parameter space, which we 
noted earlier is often computational unfeasible. Nevertheless, if the 
data are conclusive, we would expect this (unnormalised) posterior 
to be sharply peaked about the position of the maximum 6 = 6. 
In this case, we may approximate this function by performing a 
Taylor expansion about 6 = 6. Working with the log-posterior, and 
keeping only terms up to second-order in 6, leads to the Gaussian 
approximation of the (unnormalised) posterior distribution, which 
reads 

Pl(d\D)^P~r{6\D)exp[-±(6-6) T C- l (6-6)j , (8) 

where C is the estimated inverse covariance matrix of the pa- 
rameters and is given by 

C _1 = - VVlnPr(0|.D)Lg = - VVlnPr(0|.D) 



0=0 • 



(9) 
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Substituting the form (Q) into (^), we thus find that an approxima- 
tion to the value of the evidence is given by 

Pr(D\H ) « (27i) M / 2 |C| 1 / 2 Pr(0)Pr(£>|0) (10) 

where M is the number of parameters of interest 6 and we have 
rewritten Pr(#| D) using (Q). Since the estimation of parameter val- 
ues and their associated errors already requires one to calculate all 
the quantities on the right-hand side of (llOp, we see that this ap- 
proximate evidence may be evaluated with no extra work. We note, 
however, that for < fic| ) to hold, the prior and likelihood must be cor- 
rectly normalised, such that / Pr(0) dG = 1 and /Pr(D|0)t/D = 1. 
We also note, in this case, that consideration of the ratio of evi- 
dences is a natural generalisation of the standard likelihood-ratio 
approach to model comparison. 



2.4 Markov-Chain Monte-Carlo methods 

We note, in passing, that the approach to parameter estimation 
and the approximate evaluation of evidences outlined above may 
soon become obsolete. With the advent of faster computers and ef- 
ficient algorithms, it has recently become numerically feasible to 
sample directly from the posterior distribution using Monte-Carlo 
Markov-Chain (MCMC) techniques (see Knox, Christensen & 
Skordis 2001). This allows one trivially to obtain one-dimensional 
marginalised posteriors for each parameter of interest a;, and 
also enables the direct numerical evaluation of evidence integrals. 
Clearly, the MCMC technique has enormous potential for the future 
estimation of cosmological parameters. 



3 HYPERPARAMETERS 

In this paper, we wish to construct a robust technique for perform- 
ing a joint estimation of cosmological parameters from combined 
datasets. The basic idea behind the approach presented here is to in- 
troduce additional hyperparameters a. into the Bayesian inference 
problem. In other words, we extend our parameter vector to include 
not only the parameters of interest 8, but also the hyperparameters 
a. These hyperparameters are analogous to 'nuisance' parameters 
that often arise in the standard approach to parameter estimation. 
In our case, however, they are not present in our model a priori, but 
we have chosen to introduce them in order to allow extra freedom 
in the parameter estimation process. 



3.1 Marginalisation over hyperparameters 

As with any set of nuisance parameters, we must integrate out (or 
marginalise over) the hyperparameters a in order to recover the 
posterior distribution of our parameters of interest 6. Thus, we have 



(13) 



Pr(6>|i?) 



Pr(6,a\D)da 
1 



Pr(Z») 



Vr{D\6,a)Vr(6,cx)da. 



(11) 



In this paper, we shall assume that the parameters of interest 6 and 
the hyperparameters a are independent, so that 



Pr(0,a) = Pr(0)Pr(a) 



(12) 



Substituting this form of the prior into (|l l[), we immediately re- 
cover Bayes' theorem ([!]), where the form of the likelihood function 
in the presence of hyperparameters is 



Pr(D\0) = J Pr(D\6,a)Pr{a)da. 

Indeed, under the assumption (|l2|), this expression for the likeli- 
hood embodies the complete hyperparameters technique. 

Let us now turn our attention to the form of this likelihood 
function. Assuming that the totality of data D consists of N inde- 
pendent datasets D k (k = 1, .. . ,N), we have 



Pr(D\e,cx) = l\Pi(D k \e,cx) 



(14) 



Moreover, in this paper, we will make the additional simplifying 
assumption that the kth term in the product on the right-hand side 
of (|l4J) depends only on oc^. In this case, it reduces to 

N 

Pv(D\e, a ) = l\P I (D k \e,a k ). (15) 

k=\ 

We shall also assume that the individual hyperparameters a,- 
= l,...,N) are themselves independent, so that Pr(a) = 
Pr((Xi)Pr(a2) • • -Pr(aAr). Thus the expression ( |l3| ) for the likeli- 
hood becomes 



N , 

Pr(D\e) = / Pr(D k \0,a k )Pt{a k )da k , 



(16) 



which is simply the product of the individual likelihoods P(D/ ( \0) 
for each dataset, after marginalisation over the hyperparameter. 
This form can be substituted into Bayes' theorem M to obtain the 
posterior Pi(G\D), which can then be used to obtain the best esti- 
mates 6 of the parameters and their associated errors. 

Finally, it is worth noting that we may use ( |lo"| ) to write the 
standard approach to parameter estimation, in which no hyperpa- 
rameters are introduced, as a special case of the hyperparameters 
technique. This is achieved by fixing the hyperparameters CC& to 
have particular values oc^ (k = l,...,N), which is easily accommo- 
dated in the above formalism by assigning the priors 



Pr(a*) = 8(a£-a£), 



(17) 



where 8(z) is the Dirac delta function. Most often, the hyperpa- 
rameters are introduced in such a way that oc^ = 1 (for all k) cor- 
responds to the standard approach, in which hyperparameters are 
absent. In other words, the standard likelihood function for the k\h 
dataset may be denoted by Pi(D k \0, a k = 1). 



3.2 Hyperparameters and evidence 

So far, we have not addressed the question of whether the data 
support the introduction of hyperparameters in the first instance. 
For example, if a collection of different datasets are all mutually- 
consistent, then one might not consider it wise to introduce the 
hyperparameters ex. Indeed, their introduction could lead to larger 
uncertainties on the estimated values of the parameters of interest 
0, since one has to perform a marginalisation over a. On the other 
hand, if several of the datasets are not in good agreement, the use of 
hyperparameters is necessary, in order to obtain statistically mean- 
ingful results. 

In fact, Bayes' theorem itself allows us to make an objective 
decision regarding whether the data warrant the introduction of hy- 
perparameters. As discussed in section 2.2, this may be achieved 



using the Bayesian evidence Pr(D). For illustration, let us consider 
two models for the data as follows: the model Hq does not include 
hyperparameters, whereas the model H\ assigns a free hyperparam- 
eter to each dataset. In either case, the probability of obtaining the 
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observed data D is 



tdO 



Pr(£>) = I Pr{D,G,a)dat 

= I Pr{D\G,a)Pi(6,a)dadG 



Pr(£>|0)Pr(0)rf0, 



(18) 



where, in the last line, the likelihood Pr(D\0) is given by (|l6|). 
For Hq, the priors are simply Pr(a k ) = 8(0^ — 1) and the integral 
simplifies accordingly, whereas for H\ the priors will, in general, 
have some more complicated form. 

Denoting the resulting evidence values for our two hypotheses 
by Vx{D\Hq) and Pi(D\H\) respectively, the question of whether 
the data warrant the introduction of hyperparameters, is now an- 
swered simply by comparing the relative values of Pi(D\Hq) and 
Pt(D\H\). The hypothesis having the larger evidence is the one 
that should be accepted. If Pi(D\Hi) < Pi(D\H ), this indicates 
that the datasets are all mutually consistent and that the introduc- 
tion of hyperparameters is not warranted. Conversely, the condition 
Pr(D\H\ ) Pt(D\Ho) is strong indication that the datasets are not 
mutually consistent, and that hyperparameters are necessary in or- 
der to obtain meaningful statistical results. 



4 WEIGHTING OF DATASETS 

In the above discussion, we have specialised to the case where 
a single hyperparameter is associated with each dataset. We have 
not yet, however, made explicit how this hyperparameter enters the 
form of the modified likelihood Pi(D\9 ,u k ) for each dataset. Nei- 
ther have we fixed the form of the prior Pr(ot^) on each hyperpa- 
rameter. As mentioned in the Introduction, an obvious use of hyper- 
parameters in cosmological parameter estimation is in the weight- 
ing of the different datasets being analysed. We now consider this 
application in more detail. 

4.1 The prior on each hyperparameter 

Let us first turn our attention to the prior Pi(u k ) on each hyper- 
parameter. As discussed in Paper I, since a k is a scale parameter, 
we might adopt the 'non-informative' or Jeffrey's prior, which is 
uniform in the logarithm of the parameter. Thus, we have 

5(0*) = -J-, (19) 

(or, more generally, one might take Pr(oc) = 1 /a", where n is a non- 
negative integer). The Jeffrey's prior does, however, cause some 
difficulties, since it is improper and so cannot be normalised (as 
indicated). This is not a problem for obtaining estimates of the pa- 
rameter values 6 and their associated errors, since these quantities 
do not depend on the overall normalisation of the posterior. How- 
ever, the use of an improper prior makes it impossible to calculate 
evidences, and so we are unable to assess the whether the data war- 
rant the introduction of the hyperparameters. 

We must ask ourselves, however, whether we are truly igno- 
rant of the value of each hyperparameter cx^. In fact, this is rather 
unlikely. Given the nature of the data-weighting problem, we might 
suppose that the expectation value of oij is unity. In other words, in 
the first instance, we expect that the experimental data have been 
correctly analysed and that the results are free from any system- 
atic bias or misquoted random errors. Thus, we need to assign a 
prior probability distribution Pi((X k ) subject to the constraint that 



E[a k ] = 1. As shown by Jaynes (1957a,b), building on the work 
of Shannon (1948), the only consistent way of assigning the prior 
probability distribution Pr(a^) is by maximising the 'entropy' func- 
tional 



S{Pr(a k )] 



Pr(a k )lnPt(a k )da k , 



(20) 



subject to the normalisation constraint J^ > Pi(a k )dai ( = 1 and the 
constraint E[a k ] = JJ° o^Pr^) da k = 1 on the expectation value. 
This is a straightforward problem in the calculus of variations, and 
has the simple solution 



Pr(a*) = exp(-a*), 



(21) 



which is, of course, properly normalised. In this paper, we shall use 
this exponential form of the prior. 

We note in passing, however, that in some cases one may not 
only have E[u k ] = 1, but also some a priori expectation on the vari- 
ance of CL k , i.e. some limit on the range of weights that should be 
assigned to each dataset. In this case, we need to assign a prior 
probability distribution Pr(ai l ) by maximising ( ^o| ) subject to the 
constraints Jq Pr(a/ ( ) da^ = 1, E[a k ] = 1 and V[a k ] = a 2 (say). 
Unfortunately, this problem has no solution as stated. Nevertheless, 
a closed-form solution is easily obtained if one does not restrict oc* 
to be non-negative, but instead allows it to take any value between 
— <x and oo (thereby modifying the limits on the integral in ( pp[ ) and 
in the normalisation of Pr(a k )). In this case, the calculus of varia- 
tions problem is easily solved to obtain the Gaussian form 

l) 2 " 



Pr(at) = 



1 



2kg 



exp 



2o 2 



(22) 



which is again properly normalised. Clearly, this form is not strictly 
applicable to weights, which are required to be non-negative. Nev- 
ertheless, provided OC 1, the above Gaussian form will be a good 
approximation to the true prior. 

4.2 The likelihood function for each dataset 

If the hyperparameters a k are to act as weights on the different 
datasets, we specify the log-likelihood function for each dataset to 
have the form 

\nPt{D k \0,a k ) = a k \r&r{D k \e,a k = l)-\zZ k {0,a, k ), (23) 

where Pi(D k \0, a. k = 1) is the standard likelihood function (in the 
absence of hyperparameters) and Z k is a normalisation factor that 
ensures JPi(D k \d,a k )dD k = 1. The expression ( ^3[ ) for the log- 
likelihood clearly corresponds to the likelihood function itself hav- 
ing the form 

[Pr(A|0,a* = l)r 



Pr(D k \6,a k ) 



Z k (0,a k ) 



(24) 



from which we see that the normalisation factor has the explicit 
form 



Z k (0,a k ) 



j [Pr(Z>, 



U0,a* = l) 



dD k 



(25) 



In particular, we note that Z k (6, 1) = 1. From (p3|), we see that 
the standard approach to joint parameter estimation corresponds to 
taking a k = 1 for those datasets that are included in the analysis, 
and = for those that are discarded. Using the prior (pT|), and 
the form of the likelihood for each dataset given in (^4|), we find 
that the new likelihood function for the k dataset is given by 



Pr(D k \0) 



i; 



[Pr(D k \e,a k = l) 

z k (9,a k ) 



da k . 



(26) 
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4.3 The full likelihood and posterior distribution 

Since the datasets D k (k=l,...,N) are assumed to be independent, 
the full likelihood function is given by 



Pr(D\0) = l\Pr(D k \0), 



(27) 



where Pr(D k \0) is given by (Q). The posterior distribution 
Pr(6\D) of the parameters of interest is then given by Bayes' the- 
orem (111). 

For the purposes of illustration, let us suppose finally that the 
prior Pi(6) on the parameters of interest is uniform. In order that 
we may subsequently calculate evidences, we must ensure that this 
prior is properly normalised. Suppose the prior is zero outside some 
(large) region %, is the M-dimensional parameter space. Thus, we 
may define the prior as 



Pr(6») : 



1/V(X) if01iesin3{. 
otherwise, 



(28) 



where V(%.) = J.^ Pi(6) d6 is the 'volume' of the region 

The resulting posterior distribution ~Pi(6\D) can then be cal- 
culated over some M-dimensional hypercube in parameter space, or 
used in the standard way to obtain the estimates 6 and their associ- 
ated errors. We note that, since the hyperparameters a* have been 
marginalised over in (p7[), they do not have specific values that can 
be quoted at the end of the analysis. Nevertheless, it can be useful 
to know which values of a k were most favoured, and hence which 
datasets were given a large or small weight in the analysis. Thus, 
at each point in the space of parameters 6, we define the 'effective' 
weight a e k s (0) to be that which maximises the corresponding in- 
dividual likelihood Pi(D k \6, a*) at that point. Of course, the most 
relevant set of such quantities are those evaluated at the point 6 = 6. 



5 GAUSSIAN LIKELIHOOD FUNCTIONS 

Our analysis has thus far been presented in a general form, which 
may be applied to a wide range of problems in which one wishes 
to weight different datasets. It is quite common, however, for the 
likelihood function to take the form of a multivariate Gaussian in 
the data. In this case, for the Mi dataset, we have 



Pr(A|0,a* = 1) = 7 _ T - 7 i_^exp(- i X f 



(27t)™*/ 2 | V k \V 2 



where 



X 2 k = (D k -» k ) T V k - l (D k -v k ). 



(29) 



(30) 



In these expressions, n k is the number of items in the Mi dataset, 
fi k is the expectation value of the quantities D k and V k is their 
covariance matrix. In general, both fi k and V k may depend on the 
parameters of interest 6. (note that, in general, the likelihood func- 
tion is not a multivariate Gaussian in the parameter space 6, unless 
V k does not depend on the parameters 6 and fj, k depends only lin- 
early on them). 

We note that, even when the form of the likelihood is not 
Gaussian, one can often perform a coordinate transformation to a 
new set of data variables z k , for which the likelihood is (approx- 
imately) Gaussian (see, for example, Bond, Jaffe & Knox 2000). 
Moreover, as shown by Bridle et al. (2001) in the context of CMB 
observations, if the data D k originally obey a Gaussian likelihood 
function of the form (^), then one can perform analytic marginal- 
isations over calibration and 'beam' uncertainties that yield a new 



likelihood function which is also of Gaussian form, but with a mod- 
ified 'covariance' matrix Vl. 

In the standard approach to estimating the parameters 6, one 
assumes a model Hq, in which no hyperparameters are introduced. 
Thus the full likelihood function is given simply by 



Pr(D|Mb) = n 



1 



(31) 



where, on the left-hand side, we have made explicit the condition- 
ing on Hq. One then substitutes this expression into (Q) to obtain 
the corresponding (unnormalised) posterior Pi(8\D,Hq). 

Alternatively, we may adopt the model Hi, in which a free 
hyperparameter a k is assigned as a weight to each dataset. Defining 
the modified likelihood function for each dataset Pr(D k \6,a k ) by 
(|24j), and using ( p5| ) to evaluate the normalistion factor Z k (6,a k ), 
one finds 

Pr(A|0 '°'* ) = ( 2 ^-/2| F ,|i/2 a " /2ex P H a ^) ' P 2 ) 

This result may then be substituted into the expression ( [26] ) to ob- 
tain the new likelihood function for the Mi dataset, which is given 
by 



(33) 



Pr(D k \e) = i — ra" kl2 e- a ^-n+^da k . 

v *' ; (27t)"*/2| V k \V 2 Jo k 

The integral over oc^ can be performed easily, using the definition 
of the Gamma function T(n) = f^f x n ~ l e~ x dx, and one finds 



Pv(D k \6) 



2r(^ + i 



r0d + 2p 2 



+1 



(34) 



which is properly normalised. Thus, assuming the datasets to be 
independent, the full likelihood ( p7[ ) is given by 

N ?r (2t 4- 1 ) Ink , A 

^^U^w 5 ^ ' (35) 

where we have made explicit the conditioning on Hi . As above, 
this form of the likelihood can then be substituted into (Q) to obtain 
the corresponding (unnormalised) posterior Pi(6\D,Hi )■ 

5.1 Evaluation of the posterior and evidences 

Ideally one would wish to calculate the full (unnormalised) posteri- 
ors Pr(#| D.Hq) and Pr(0| D,H\) on some hypercube in parameter 
space. In this way, the location of the (global) maximum is obtained 
immediately, and the presence of multiple peaks in the posterior(s) 
is readily observed. Moreover, marginalised distributions may be 
trivially calculated, in order to place confidence limits on individ- 
ual parameter values. Assuming the uniform prior (^), we see from 
( j3l| ) and (j^) that the evaluation of the posterior for the models Hq 
and H\ respectively requires similar functions to be evaluated. In 
other words, if one has an algorithm for calculating | V k \ and "A, 
as required for the evaluation of the standard posterior (model Hq), 
then one can immediately evaluate the hyperparameters posterior 
(model H\). 

An additional advantageous feature of calculating the full (un- 
normalised) posteriors Pr(6\D ,Hq) and Pr(6\D,Hi) on a hyper- 
cube in parameter space is that it allows the immediate evaluation 
of the evidence in each case, which are given by 



(36) 
(37) 



Pt(D\Hq) = I Pr(6\D,H )d6, 
Pr(D|Hi) = jpi{0\D,Hi)d0. 
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The ratio of these quantities can then be used to decide whether or 
not the inclusion of hyperparameters is warranted by the data. 



5.2 Estimation of parameter values 

Unfortunately, evaluation of the full posterior distribution(s) on a 
hypercube in parameter space is only feasible when the number of 
parameters M is small (although, as mentioned in section 2.3, this 
restriction may be overcome using MCMC sampling techniques). 
For larger problems, the estimates of the parameters of interest 
are usually obtained by maximising the posterior. Assuming the 
uniform prior (|28|), for the standard model Hq (without hyperpa- 
rameters) this corresponds to minimising the function 



-21nPr(0|£>,tf o ) = £ N V k \ +c, 

k=\ V ' 



(38) 



where c is a constant. This is, of course, a very familiar result. For 
the model H\ , however, which contains hyperparameters, one must 
instead minimise 

N r . 

-21nPr(0|£>,ff 1 ) = £ ln| V*| + + 2)ln( X f + 2 ) +c, (39) 

k=l 1 1 

where c is a (different) constant. 

From ([3^), for the model Hq, the parameters estimates 0q may 
be obtained by solving 



£(Vln|V i |+Vxf)=0, 



(40) 



whereas for the model H\, the estimates 9\ are found by solving 
K+2)Vln( X 2 + 2)]=0. (41) 



living 



In the last case, however, we may write 
Vln(3cf+2) = (V3cf)/(3cf + 2). 



(42) 



Thus, if one is able to evaluate the derivatives Vln| VjJ and V%| 
for k = 1, ... , N, to obtain the standard parameter estimates Oq, one 
may easily obtain the estimates 0\ . 

We note, in passing, that in the special case where the covari- 
ance matrices Vj do not depend on the parameters 0, the expres- 
sion ( ^ ) is very similar to that presented in Paper I. In that paper, 
an improper Jeffrey's prior was assumed on the hyperparameters, 
and it was found that the parameter estimates 0\ were given by 
minimising the quantity 



(43) 



ifc=l 



Unfortunately, the corresponding posterior in this case cannot be 
normalised, as a result of using an improper prior. Nevertheless, 
in the limit where the number of data items n k in each dataset is 
large, we would expect %| also to be large. Thus, in this limit, min- 
imisation of the expressions ( p9| ) and ( p^ ) respectively (in the case 
where the V* are independent of the parameters 0) should give al- 
most identical parameter estimates 0\. 

As mentioned in section 4.3, since we have marginalised over 
the hyperparameters (k = l,...,N) in (|3^), we cannot quote a 
value for them at the end of the analysis, but we can determine 
an 'effective' value oc^ ff (0) for each hyperparameter at any point 
in parameter space. In each case, this is given by the value of 
that maximises the modified likelihood Pr(D\0,a/ { ) at that point. 



Differentiating the expression (^) with respect to a k and setting 
the result equal to zero, we find 



af(0) 



(44) 



Clearly, this expression is most meaningful when evaluated at the 
point = 0. 

Finally, it is worth noting the case in which minimisation of 
the functions ( |3^ ) and (^) gives the same parameters estimates 
0. We see that the expressions ( ^p| ) and ([H]) are identical when 
%l(0) = «i for all k. From (p4|), we see that this corresponds to the 
case in which a? ff (0) = 1 for all k. 



5.3 Error estimates and approximate evidences 

The estimated covariance matrix of the parameter errors is given by 
From ( p8[ ) and (p9[), for the models Hq and H\ respectively, we 
have 



r- 1 

°0 



(45) 



J£[vvto|v t |+w x !] e=(V 

k=l 

£ [Win \V k \ + (n k + 2)VVln(x| + 2)] 0= ^ . (46) 



k=l 

Since we may write 

(X* + 2)Wxg-(Vxg)(Vx2) 



Wln^ + 2) 



(X 2 k + 2) 2 



(47) 



we see that, if one can calculate the functions Win | Vjt| and W%? 
(together with V%?, which is required for obtaining the parameter 
estimates) to give the standard (inverse) covariance matrix GT , 
one may easily calculate Cj - . 

Once the (inverse) covariance matrices of the errors have been 
calculated, the evaluation of (approximate) evidences is straightfor- 
ward. Using the approximate expression ( |Io[ ) for the evidence, and 
assuming the uniform prior (|2^), we see that the ratio of evidences 
for the two models Hq and Hi is given by 



Pr(D\Hi) \C 1 \V 2 I>r(D\0 1 ,H 1 ) 



Pi(D\H ) \C o \ l / 2 Pr(D\0 o ,H Q )' 



(48) 



If this ratio is much less than unity, one concludes that the datasets 
are mutually consistent and do not support the introduction of hy- 
perparameters; one should then quote the results of the standard 
analysis, given by the estimates Oq and the covariance matrix Co. 
If the ratio is much greater than unity, however, then this indicates 
that inconsistencies do exist between (some of) the datasets and 
that the inclusion of hyperparameters is warranted; it is then more 
appropriate to quote the estimates 0\ and the covariance matrix C\ . 



6 A TOY PROBLEM 

To illustrate the potential usefulness of the hyperparameters ap- 
proach to weighting datasets, in this section we apply the method 
to the classic parameter estimation problem of fitting a straight line 
through a set of data points. In particular, we will show how the use 
of hyperparameters as weights can overcome the common prob- 
lems of inaccurately quoted error-bars and the presence of system- 
atic errors in the measurements (see also Bridle 2000). 

Let us suppose that the true underlying model for some pro- 
cess is the straight line 
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Figure 1. Left: the two data sets D\ (solid circles) and Dj (open squares) drawn from the straight line model with slope m = 1 and intercept c — 1 (solid line), 
and subject to independent Gaussian noise of rms o"[ =0.1 and o~2 = 0.1 respectively. Middle: the (unnormalised) posterior Pr(0|D.//o) corresponding to the 
standard approach to parameter estimation. Right: the (unnormalised) posterior Vr(G\D ,H\ ) corresponding to the hyperparameters approach. In each case, the 
true parameter values are indicated by the solid circle, and the contours contain 68, 95 and 99 per cent respectively of the total probability. 



y(x) = mx + c, 



(49) 



where the slope m = 1 and the intercept c = 1. We assume that two 
independent sets of measurements D\ and D2 are made, each of 
which contains five data points, so n\ = 5 and 112 = 5. In simu- 
lating each dataset, the x- values are drawn at random from a uni- 
form distribution between zero and unity, and we assume that these 
values are known precisely. The corresponding y-values are inde- 
pendently distributed about their true model values according to 
a Gaussian distribution of known variance for each data set. 
Thus, the likelihood function for the Mi dataset has the Gaus- 
sian form given by (E9|) and where (fi^) j = w(asj) j + c and 
Vk = diag(o|, . . .,&[)■ Thus, in this simple example, the covari- 
ance matrix does not depend on the parameters m and c. 

We now consider three different scenarios, which illustrate 
the general useful properties the hyperparameters technique for 
weighting datasets. We note that, for this simple toy problem, it is 
clear from a plot of the data points if inconsistencies exist between 
two datasets, and we will see that the hyperparameters approach 
verifies our initial suspicions where appropriate. Indeed, this toy 
model was chosen so that inconsistencies revealed by the hyperpa- 
rameters method are transparent from a plot of the data. It should be 
remembered, however, that for realistic cosmological datasets and 
theoretical models, the situation is usually much more complicated. 
In this case, one is often unable visually to discern any inconsisten- 
cies in the data, but these may nevertheless be deduced by using the 
hyperparameters technique. This is illustrated in section ^. 



6.1 Accurate error-bars and no systematic error 

In our first case, both data sets Di and are drawn from the 
correct model m = 1 and c = 1, The noise rms for each dataset is 
o"i = o"2 = 0.1, and these (correct) values are used in the parameter 
estimation process. The two simulated datasets are shown in Fig. [I] 
(left panel), in which the dataset D\ is denoted by the filled circles, 
whereas D2 corresponds to the open squares. The true underlying 
model is plotted as a solid line. 

Assuming an appropriately normalised uniform prior on the 
parameters in the range m : — * 2 and c : — ► 2, and using the 
expressions ( |3l| ) and ( |3^ ) for the likelihood in each case, we 
may calculate the (unnormalised) posteriors Pr{m,c\D ,Hq) and 
Pr(m, c\D,H{) corresponding to the standard and hyperparameters 



approaches respectively (where D denotes the combination of the 
datasets D\ and D^). These posterior distributions are plotted in 
Fig. [j] (middle and right panels), in which the contours contain 68, 
95 and 99 per cent respectively of the total probability; the true 
parameter values are indicated by a solid circle. 

In both cases Hq and H\, the posterior distributions appear 
very similar and contain the true parameter values at about the 68 
per cent confidence level. However, since the prior and the likeli- 
hood function in each case are properly normalised, the greyscale 
units on the plots in Fig. [I] are not arbitrary. Indeed, the evidence 
in each case can be calculated directly by integrating numerically 
under these distributions, and we find 



Pr(D\H ) 



0.54. 



This indicates that standard approach is very slightly preferred or, 
equivalently, that the introduction of hyperparameters is marginally 
disfavoured by the data. An evidence ratio of order unity does not, 
however, provide a strong indication that one case is preferred over 
the other. We may conclude that, in this case, our ability to infer the 
parameter values m and c from the data is essentially unaffected by 
the introduction of hyperparameters. 

It is interesting to compare the exact evidence ratio above with 
that obtained using the approximate expression (^), in which each 
posterior distribution is approximated by a Gaussian centred at its 
peak. Since the function (n% is linear in the parameters m and c, 
and the covariance matrices V* do not depend on these parame- 
ter, the posterior Pt(G\D,Hq) is, in fact, truly Gaussian. We found, 
in this case, that the approximate expression ( |ic| ) for the evidence 
did indeed agree with that obtained by direct numerical integration. 
In the hyperparameters case, however, the posterior Pi(G\DMi ) is 
not Gaussian. Nevertheless, we found that the approximate expres- 
sion ( JToj ) underestimated the true evidence value by only 2 per cent, 
and the approximate evidence ratio was found to be 0.53. 

Finally, we note the effective weight assigned to each dataset 
at the peak 6\ of the hyperparameters posterior Pi(6\D ,Hi ). These 
values are given by ( M4p evaluated at this point, and we find 



af (0i 



1.79 and af 



(0^ 



2.64. 
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Figure 2. As for Fig. [l], but the reported error-bars on the dataset D\ have been underestimated by a factor of 5. 



6.2 Inaccurate error-bars and no systematic error 

In this case, the datasets D\ and Di are identical to those used in 
the previous subsection, but we assume that the quoted errors-bars 
on the dataset D\ have been underestimated by a factor of 5, while 
the error-bars on the dataset D2 are quoted correctly. Thus, in the 
parameter estimation procedure, we assume the (incorrect) values 
Gi =0.02 and 02 =0.1. The data points and their reported error- 
bars are shown in Fig. ^ (left panel). 

The resulting (unnormalised) posteriors Pr(6\D,Ho) and 
Fv(9\D ,Hi) are shown in the middle and right-hand panels of 
the figure. In this case, the two posteriors are very different. In 
the standard approach Hq, the posterior distribution is tightly con- 
strained about its maximum as a result of underestimating the errors 
on dataset D\ . Indeed, this posterior is virtually indistinguishable 
from that calculated from the dataset D\ alone. The true parame- 
ter values are excluded at a confidence level that far exceeds the 
99 per cent limit. In the hyperparameters case, however, the poste- 
rior is much broader and resembles the corresponding distribution 
in Fig. |l| In particular, the true parameter values are comfortably 
contained within the 95 per cent confidence limit. 

Integrating under the posterior distributions directly, the exact 
evidence ratio is found to be 

H T\ =2-i xio 4 , 

Pr(D\H ) 

which clearly implies that the data favour the introduction of hy- 
perparameter weights. Using the expression (p^|), the approximate 
evidence ratio is 1.6 x 10 4 , which shows that the Gaussian approx- 
imation to the hyperparameters posterior is once again reasonably 
accurate. 

At the peak 6\ of the hyperparameters posterior, we find the 
effective weights assigned to the two datasets D\ and D2 to be 
afidi) = 0.12 and aSf^i) = 1.28. Thus, the first dataset (with 
error-bars underestimated by a factor of 5) has been assigned an 
appropriate smaller statistical weight. 



6.3 Accurate error-bars and a systematic error 

We now consider the introduction of a systematic error into the 
dataset D\ . This is simulated by drawing this dataset from an in- 
correct straight-line model, for which the parameter values m and 
c differ from unity. Dataset D2, however, is still drawn from the 
correct straight-line model with m = 1 and c = 1. We shall, in fact, 
consider two separate complementary cases. In the first case, we 



introduce a systematic error in the direction of the natural degen- 
eracy line in the (m.c)-plane, whereas, in the second case, the in- 
troduced systematic error is orthogonal to the natural degeneracy 
line. In both cases, we assume that the error-bars on each dataset 
are quoted accurately as Oi =0.1 and 02 = 0.1 respectively. 

6.3.1 Case 1 

In our first case, dataset D\ is drawn from a straight line with slope 
m = and intercept c = 1.5. The datasets D\ and Di are shown in 
Fig. ^ (left panel), together with the underlying straight-line mod- 
els from which each is drawn. The resulting posterior distributions 
Pr(G\D ,Hq) and ~Pr(6\D ,H\) are shown in the middle and right- 
hand panels of the figure. Once again, the two posteriors are very 
different. In the standard approach, Hq, the posterior distribution 
peaks between the two sets of true values. In spite of the fact that 
the two sets of parameter values define a direction along the natural 
degeneracy line in the (m,c) -plane, neither is contained within the 
99 per cent confidence contour, and so both models are excluded 
at a high significance level. In the hyperparameters case, however, 
the posterior is much further extended along the natural degener- 
acy line. In particular, we note that the distribution is bimodal, with 
each peak lying close to one of the true sets of parameter values. 
Thus, the hyperparameters indicates the presence of two underly- 
ing models for the data, which signals an inconsistency between 
the two datasets. This could be interpreted as one (or both) of the 
datasets containing a systematic error. 

The exact evidence ratio in this case is found to be 

Pr(fl|flj) 
Pr(D\H ) L1 - ' 

which gives a reasonably robust indication that the data favour the 
introduction of hyperparameters. Using the expression (p^[), the ap- 
proximate evidence ratio is given by 4.8. The reason for the large 
inaccuracy in this case is that the Gaussian approximation to the 
bimodal hyperparameters posterior is clearly rather poor. In fact, as 
might be expected in this case, the Gaussian approximation under- 
estimates the true value of the evidence by about a factor of two. 

Although the hyperparameters posterior is bimodal, the global 
maximum 6\ occurs at the peak close to the parameter values from 
which the dataset D\ was drawn. At this peak, we find the effective 
weights assigned to the two datasets D\ and D2 are 0C[ ff (#i) = 
3.11 and a^(8i) = 0.18, which indicates (correctly) that the first 
dataset has been assigned a larger statistical weight at this point in 
parameter space. However, at the subsidiary peak 0\ located near 
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Figure 3. Left: the dataset Di (solid circles) drawn from the straight-line model with slope m = and intercept c = 1.5 (dashed line) and the dataset D2 (open 
squares) drawn from the straight-line model with slope m = 1 and intercept c = 1 (solid line); each dataset is subject to independent Gaussian noise of rms 
o"i =0.1 and 02 = 0.1 respectively. Middle: the (unnormalised) posterior Pr(0|D ,Hq) corresponding to the standard approach to parameter estimation. Right: 
the (unnormalised) posterior ¥r(d\D ,H{ ) corresponding to the hyperparameters approach. In each case, the true parameter values for dataset D{ are indicated 
by the solid circle and the true parameter values for dataset Di are indicated by an open square. The contours in each case contain 68, 95 and 99 per cent 
respectively of the total probability. 




Figure 4. As for Fig. |^, except that the dataset D\ (solid circles) is drawn from the straight-line model with slope m = 0.7 and intercept c = 0.7. 



the parameter values from which the dataset D2 was drawn, we 
find af(6[) =0.11 and af(§i) = 6.41, and so the roles of the 
datasets have been reversed. 



6.3.2 Case 2 

In our second case, dataset D\ is drawn from a straight line with 
slope m = 0.7 and intercept c = 0.7. The datasets D\ and D2 are 
shown in Fig. Q (left panel), together with the underlying straight- 
line models from which each is drawn. The the middle and right- 
hand panels of the figure show the resulting posterior distributions 
Pr(0 1 D , H ) and Pr(0 1 D , Hi ) . As in Case 1 , the standard approach 
produces a posterior distribution that peaks between the two true 
sets of parameters values, excluding both at a high significance 
level. We note, in this case, that the two sets of true parameter val- 
ues define a direction orthogonal to the natural degeneracy line in 
the (m,c)-plane. In the hyperparameters case, however, the poste- 
rior is again bimodal, with each peak lying close to one of the true 
sets of parameter values. Thus, once again the hyperparameters ap- 
proach signals the presence of two underlying models for the data 
and hence an inconsistency between the datasets. 
The exact evidence ratio is found to be 



5^=5.9x103, 
Pr(D\H ) 

which strongly implies that the data favour the introduction of hy- 
perparameters. In this case, the approximate evidence ratio ( jl8| ) is 
given by 2.4 x 10 3 . As in Case 1, the Gaussian approximation un- 
derestimates the true value of the evidence for the hyperparameters 
posterior by about a factor of 2, as a result of it being bimodal. 

Once again the effective weights clearly show which dataset 
is dominating at each peak of the hyperparameters posterior. At the 
global maximum 0, , we find af ff (0i ) = 0.066 and a| ff (#i ) = 7.96, 
whereas at the subsidiary maximum 6\ we obtain af{0\) = 3.31 
and CXif (0i ) =0.079. 



7 A COSMOLOGICAL ILLUSTRATION 

In this section, we illustrate the hyperparameters technique for 
weighting datasets by applying it to recent measurements of the 
CMB power spectrum provided by the BOOMERANG (Netterfield 
et al. 2001), MAXIMA (Lee et al. 2001) and DASI (Halverson et al. 
2001) experiments. We also include the 8 data points for large-scale 
normalisation of the spectrum provided by the COBE satellite. The 
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Figure 5. The flat-band power estimates of the CMB power spectrum 
reported by the COBE, BOOMERANG, MAXIMA, DASI experiments. 
The solid line corresponds to the CMB power spectrum for a spatially- 
flat inflationary CDM model with no tensor contribution and Q.,„ = 0.3, 
Q-bh 1 = 0.02, h = 0.7, n = 1, % = and Q = 18 fiK. 



flat band-powers reported by each of these experiments are plot- 
ted in Fig pi The solid line in the plot corresponds to the predicted 
CMB power spectrum for a spatially-flat inflationary CDM model 
with no tensor contribution and £l m = 0.3, Q.i,h 2 = 0.02, h = 0.7, 
n = l,x = 0andQ = 18/iK. 

From Fig. ^, we see immediately that there is good agree- 
ment between the datasets, all of which are broadly consistent with 
the plotted theoretical CMB power spectrum. One would therefore 
expect the inclusion of and marginalisation over hyperparameter 
weights not to be warranted by the data. It should also be remem- 
bered, however, that the points and error-bars plotted in the figure 
do not include the calibration and beam uncertainties associated 
with each experiment. Indeed, since these uncertainties can be as 
large as 20 per cent in some cases, one would naively expect the 
envelope of the experimental data to be much wider, resulting in 
poorer agreement with the theoretical spectrum. As we will see be- 
low, this naive observation is supported by the proper calculation 
of Bayesian evidences. 

7.1 No calibration or beam uncertainty 

We first analyse the data without including calibration and beam 
errors, i.e. as plotted in Fig. |5| For each dataset (k= 1, 2, 3,4), 
we assume the uncertainties on the flat band-powers are described 
by a multivariate Gaussian of the form given in (^). For the COBE 
and DASI datasets, we use the publicly-available window functions 
and full covariance matrix for each dataset. In the absence of the 
equivalent information for the BOOMERANG and MAXIMA ex- 
periments, we assume a top-hat window function for the spectrum 
in each bin and neglect correlations between bins. This approach 
is a good approximation of the correct one (see de Bernardis et al. 
2001) and does not affect our conclusions. 

Denoting the totality of the resulting data by the vector D, the 
next step in the analysis is to calculate the unnormalised posterior 
distributions ~Pi(8\D ,Hq) and Pr(0\D ,Hi), corresponding to the 
standard and hyperparameters approaches respectively, for some 
set of cosmological parameters 9. For the purposes of illustration, 
we assume a spatially-flat Universe (flj- = 1), with no tensor con- 
tribution to the CMB spectrum, and take the parameter vector 6 to 



consist of 5 parameters, namely the normalisation Q, the Hubble 
parameter h, the matter density £2 m , the physical baryon density 
Q.^h 2 and the scalar spectral index n. Since the parameter space is 
only 5-dimensional, it is feasible to evaluate the unnormalised pos- 
teriors over a hypercube. Thus, we assume (suitably normalised) 
uniform priors on the parameters, with the ranges 12 < Q < 24 
piK, 0.5 < h < 0.9, 0.1 < n m < 0.5, 0.01 < Sl h h 2 < 0.04 and 
0.8 < n < 1.2. The corresponding likelihood functions for the hy- 
potheses Hq and H \ have the forms (jHJ) and ( j35j ) respectively. The 
posterior distribution is calculated at 20 points in the directions Q, 
h and Sl m in parameter space, and at 10 points in the directions n 
and £li,h 2 . The theoretical power spectrum corresponding to each 
model was calculated using CAMB (Lewis, Challinor & Lasenby 
2000). _ _ 

Since ~Pi(8\D,Hq) and Vt{0\D,H\) are each calculated over 
a hypercube in the 5-dimensional parameter space, one can calcu- 
lated the evidence integrals (^) and ( ^7| ) directly, and one finds the 
exact evidence ratio to be 

PrCDjgi) 
Pr(D\H ) 



= 0.05. 



This indicates that the data do not support the inclusion of and 
subsequent marginalisation over a free hyperparameter weight 
(k = 1,2,3,4) for each dataset. In other words, as expected, the 
datasets are statistically consistent both with one another and with 
the range of theoretical models with which they have been com- 
pared. 

For illustration, let us consider the constraints placed by the 
current CMB data on the parameters n and Cl^h 2 with and with- 
out the inclusion of hyperparameter weights. The limits placed on 
these parameters by the standard and hyperparameters approaches 
respectively are easily obtained by marginalising the correspond- 
ing posteriors Pr(0|13,//o) and Pr(6\D,H\) over the remaining 
parameters Q, h and Cl m . The resulting distributions are shown in 
Fig. [I together with corresponding the 68 and 95 per cent confi- 
dence contours. Although the above evidence ratio shows that the 
data do not favour marginalisation over hyperparameter weights, 
we see that their inclusion does not significantly affect the con- 
straints placed on the parameters. This illustrates that, for statisti- 
cally consistent datasets, the standard and hyperparameters tech- 
niques may be use interchangeable without degrading the con- 
straints on cosmological parameters. In particular, we note that the 
current CMB data are consistent with the scale invariant spectrum 
n — 1 and the nucleosynthesis constraints on Q.j,h 2 . Indeed, by per- 
forming an additional marginalisation over n or Cl^h 2 we obtain (to 
two significant figures) the one-dimensional 68-per cent confidence 
intervals 0.019 < Q b h 2 < 0.023 and 0.92 < n < 1.00 for both the 
case Hq and Hi . 



7.2 Marginalisation over calibration and beam uncertainties 

So far our analysis has ignored the calibration and beam uncertain- 
ties associated with each dataset. Any meaningful analysis of these 
data must, however, include these effects of these 'nuisance' pa- 
rameters. As with hyperparameter weights, the correct approach is 
to assign some prior to these parameters and then marginalise over 
them. For calibration and beam uncertainties, one does have a priori 
knowledge of both the expectation value of each parameter and the 
range of values it might take (i.e. its variance). Thus, as shown in 
section ^j], it is appropriate to adopt independent Gaussian priors 
on the calibration and beam uncertainties. As discussed in detail 
by Bridle et al. (2001), for Gaussian likelihood functions of the 
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Figure 6. The standard posterior Yr{n,Q,i,h-\D ,Hq) (left) and the hyperparameters posterior Yv(nXlbh-\D ,H[ ) (right) obtained by marginalisation over the 
parameters Q, h and Q m , assuming no calibration or beam uncertainties in the data plotted in Fig. n 



form (g9p, one can again perform this marginalisation analytically. 
Moreover, the resulting likelihood function after marginalisation is 
also of the form (|2^), but with a modified covariance matrix V^. 

Using these modified datasets, one can then calculate the pos- 
teriors Pi(8\D ,Hq) and ~Pi(6\D ,H\) over the same hypercube in 
the 5-dimensional parameter space as used above, and evaluate the 
evidence integrals ( |36| ) and directly. In this case, the evidence 
ratio is found to be 

Pr(D\Hi 



Pr(D\H ) 



= 0.08, 



which illustrates that the inclusion of and marginalisation over hy- 
perparameter weights is once more unwarranted. The correspond- 
ing marginalised posteriors in the («,iii/i 2 )-plane, with and with- 
out marginalisation over are again almost identical and closely re- 
semble those plotted in Fig. |^ One obtains the one-dimensional 
68-per cent confidence intervals 0.018 < £lt,h 2 < 0.024 and 0.90 < 
n < 0.99 for the case Ho (without hyperparameter weights) and 
0.018 < 0. h h 2 < 0.024 and 0.90 < n < 1.00 for the case Hi (with 
hyperparameter weights). Once again, we see that the inclusion of 
and marginalisation over hyperparameter weights has little effect 
on the parameter constraints imposed by the data. 





No HP weights 


With HP weights 


No cal/beam marginalisation 


1.0 


5> 


< 1(T 2 


With cal/beam marginalisation 


4x 1(T 4 


3> 


< 1(T 5 



Table 1. The relative Bayesian evidences for the obtaining the current CMB 
power spectrum data in the 4 cases discussed in the text. 



the marginalisation over the calibration and beam uncertainties also 
reduces the evidence for the data. This means that, when compared 
with the range theoretical models discussed above, the probabilty of 
obtaining the observed data is highest if one assumes no uncertainty 
exists in the calibration and beam properties adopted by each exper- 
iment, rather than marginalising over the reported uncertainties in 
these parameters. Thus we see that the evaluation of evidences has 
provided quantitative support for our earlier naive observation that 
the CMB datasets plotted in Fig. ^| agree remarkably well given the 
5-20 per cent calibration uncertainties alone (which are not plot- 
ted). We note here simply that this eventuality is unlikely to occur 
by chance. 



7.3 Evidence for calibration and beam uncertainties 

Although the evidence ratios given above show that the current 
CMB power spectrum data clearly do not require the use of hy- 
perparameter weights in their analysis, it is also interesting to com- 
pare the relative evidence values for the cases with and without 
marginalisation over calibration and beam uncertainties (for both 
Ho and Hi). The evidence values corresponding to each of the 4 
cases we have considered are given in Table 111 where they have 
been rescaled so that the evidence is unity for the case where 
no hyperparameter weights are introduced and no marginalisation 
over calibration and beam uncertainties is performed. We see that 
the largest evidence value is obtained for the case in which no 
marginalisation is performed over hyperparameter weights or cal- 
ibration and beam uncertainties. While one might expect from the 
data plot in Fig. | that the introduction of and marginalisation over 
hyperparameter weights might not be favoured, it is surprising that 



8 CONCLUSIONS 

We have presented a general account of the use of hyperparameters 
in the analysis of cosmological datasets. In particular, we have con- 
centrated on applying the hyperparameters technique to the prob- 
lem of weighting different datasets in a joint analysis aimed at esti- 
mating some set of cosmological parameters. The basic approach is 
to assign a free hyperparameter weight to each dataset and then per- 
form a marginalisation over these hyperparameters. This method 
allows the statistical properties of the data themselves to determine 
the effective weight given to each dataset and is in sharp contrast to 
the more common practice of excluding certain datasets from the 
analysis, which hence are assigned a weight of zero, and analysing 
the remaining datasets with equal weights. 

Assuming the expected value of the hyperparameter weight on 
each dataset to be unity, we find that the prior on each hyperparam- 
eter should be of exponential form, rather than the more commonly 
used improper Jeffrey's prior. In either case, the marginalisation 
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over the hyperparameters may be performed analytically for Gaus- 
sian likelihood functions. Since the exponential prior is properly 
normalised, in this case one may also calculate the Bayesian evi- 
dence for the data given the hyperparameters model. This evidence 
value can then be compared with the corresponding evidence for 
the data in the absence of hyperparameters, in order to determine 
whether the data warrant the inclusion of hyperparameters in the 
first instance. In each case, the evidence may be calculated either 
by direct integration, for low-dimensionality parameter spaces, or 
approximated straightforwardly by assuming each posterior to be 
Gaussian near its peak. 

The hyperparameter approach to weighting datasets is illus- 
trated by applying it to the classic toy model of fitting a straight to 
a number of datasets. We find the hyperparameters technique cor- 
rectly infers the existence of systematic errors and/or misquoted 
random errors in the datasets. In such cases, the evidence ratio for 
the hyperparameters and standard approaches clearly indicates that 
the data warrant the introduction of the hyperparameters. Neverthe- 
less, in the case where no systematic errors exists, and the random 
errors are accurately quoted, the evidence ratio correctly indicates 
that hyperparameters should not be included in the analysis. 

Finally, the hyperparameters technique is applied to the lat- 
est measurements of the cosmic microwave background (CMB) 
power spectrum by the BOOMERANG, MAXIMA and DASI ex- 
periments, together with the large-scale normalisation of the power 
spectrum provided by the COBE satellite. The evidence ratio be- 
tween the hyperparameters and standard approach shows that the 
current CMB datasets do not warrant the inclusion of hyperparam- 
eters weights and a subsequent marginalisation over them. In other 
words, the datasets are statistically consistent both with one another 
and with the range of theoretical models with which they were com- 
pared. Nevertheless, the inclusion of hyperparameters in shown not 
to lessen the constraints imposed by the data on cosmological pa- 
rameters. 

The analysis is repeated for the case in which one first 
marginalises over the beam and calibration uncertainties associated 
with each dataset, and again the evidence ratio shows that the inclu- 
sion of hyperparameter weights is unwarranted. Of more interest is 
the comparison of evidences with and without marginalisation over 
calibration and beam uncertainties. This shows that the evidence for 
the data is largest if one assumes no uncertainty in the calibration 
and beam properties, but instead adopt the values corresponding 
to those implied by the published CMB power spectrum points for 
each dataset. This is unlikely to occur by chance. 
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